{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "635d8ebb",
   "metadata": {},
   "source": [
    "# Arxiv Loader\n",
    "\n",
    "- Author: [Sunyoung Park (architectyou)](https://github.com/architectyou)\n",
    "- Peer Review : [ppakyeah](https://github.com/ppakyeah)\n",
    "- Proofread : [JaeJun Shim](https://github.com/kkam-dragon)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/10-ArxivLoader.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/10-ArxivLoader.ipynb)\n",
    "\n",
    "\n",
    "## Overview\n",
    "\n",
    "[```arXiv```](https://arxiv.org/) is an open access archive for 2 million scholarly articles in the fields of physics, \n",
    "\n",
    "mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems \n",
    "\n",
    "science, and economics.\n",
    "\n",
    "[API Documentation](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)\n",
    "\n",
    "\n",
    "To access the Arxiv document loader, you need to install ```arxiv```, ```PyMuPDF``` and ```langchain-community``` integration packages.\n",
    "\n",
    "```PyMuPDF``` converts PDF files downloaded from arxiv.org into text format.\n",
    "\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [Arxiv Loader Instantiate](#arxiv-loader-instantiate)\n",
    "- [Load](#load)\n",
    "- [Lazy Load](#lazy-load)\n",
    "- [Asynchronous Load](#asynchronous-load)\n",
    "- [Use summaries of articles as docs](#use-summaries-of-articles-as-docs)\n",
    "\n",
    "### References\n",
    "\n",
    "- [ArxivLoader API Documentation](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)\n",
    "- [Arxiv API Acess Documentation](https://info.arxiv.org/help/api/index.html)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6c7aba4",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "21943adb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f25ec196",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langchain-community\",\n",
    "        \"arxiv\",\n",
    "        \"pymupdf\",\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa00c3f4",
   "metadata": {},
   "source": [
    "## Arxiv-Loader-Instantiate\n",
    "\n",
    "You can make arxiv loader instance to load documents from arxiv.org.\n",
    "\n",
    "Initialize with search query to find documents in the Arixiv.org.\n",
    "Supports all arguments of ```ArxivAPIWrapper``` ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1be7d981",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import ArxivLoader\n",
    "\n",
    "### Enter the research topic you want to search for in the Query parameter\n",
    "loader = ArxivLoader(\n",
    "    query=\"Chain of thought\",\n",
    "    load_max_docs=2,  # max number of documents\n",
    "    load_all_available_meta=True,  # load all available metadata\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a073f0b5",
   "metadata": {},
   "source": [
    "### Load\n",
    "\n",
    "Use ```Load``` method to load documents from arxiv.org with ```ArxivLoader``` instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "69cb77da",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Contrastive Chain-of-Thought Prompting\n",
      "Yew Ken Chia∗1,\n",
      "Guizhen Chen∗1, 2\n",
      "Luu Anh Tuan2\n",
      "Soujanya Pori\n",
      "{'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\\nreasoning, the underlying process remains less well understood. Although\\nlogically sound reasoning appears inherently crucial for chain of thought,\\nprior studies surprisingly reveal minimal impact when using invalid\\ndemonstrations instead. Furthermore, the conventional chain of thought does not\\ninform language models on what mistakes to avoid, which potentially leads to\\nmore errors. Hence, inspired by how humans can learn from both positive and\\nnegative examples, we propose contrastive chain of thought to enhance language\\nmodel reasoning. Compared to the conventional chain of thought, our approach\\nprovides both valid and invalid reasoning demonstrations, to guide the model to\\nreason step-by-step while reducing reasoning mistakes. To improve\\ngeneralization, we introduce an automatic method to construct contrastive\\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\\ncontrastive chain of thought can serve as a general enhancement of\\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}\n"
     ]
    }
   ],
   "source": [
    "# Print the first document's content and metadata\n",
    "docs = loader.load()\n",
    "print(docs[0].page_content[:100])\n",
    "print(docs[0].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "545cd3e8",
   "metadata": {},
   "source": [
    "- If ```load_all_available_meta``` is False, only partial metadata is displayed, not the complete metadata."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08828d86",
   "metadata": {},
   "source": [
    "### Lazy Load\n",
    "\n",
    "When loading large amounts of documents, If you can perform downstream tasks on a subset of all loaded documents, you can ```lazy_load``` documents one at a time to minimize memory usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e9df325a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Contrastive Chain-of-Thought Prompting\n",
      "Yew Ken Chia∗1,\n",
      "Guizhen Chen∗1, 2\n",
      "Luu Anh Tuan2\n",
      "Soujanya Pori\n",
      "{'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\\nreasoning, the underlying process remains less well understood. Although\\nlogically sound reasoning appears inherently crucial for chain of thought,\\nprior studies surprisingly reveal minimal impact when using invalid\\ndemonstrations instead. Furthermore, the conventional chain of thought does not\\ninform language models on what mistakes to avoid, which potentially leads to\\nmore errors. Hence, inspired by how humans can learn from both positive and\\nnegative examples, we propose contrastive chain of thought to enhance language\\nmodel reasoning. Compared to the conventional chain of thought, our approach\\nprovides both valid and invalid reasoning demonstrations, to guide the model to\\nreason step-by-step while reducing reasoning mistakes. To improve\\ngeneralization, we introduce an automatic method to construct contrastive\\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\\ncontrastive chain of thought can serve as a general enhancement of\\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}\n"
     ]
    }
   ],
   "source": [
    "docs = []\n",
    "docs_lazy = loader.lazy_load()\n",
    "\n",
    "# append docs to docs list\n",
    "# async variant : docs_lazy = await loader.lazy_load()\n",
    "\n",
    "for doc in docs_lazy:\n",
    "    docs.append(doc)\n",
    "\n",
    "print(docs[0].page_content[:100])\n",
    "print(docs[0].metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1060a864",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0abd3e83",
   "metadata": {},
   "source": [
    "### Asynchronous Load\n",
    "\n",
    "Use ```aload``` method to load documents from arxiv.org asynchronously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c6eb92e7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Contrastive Chain-of-Thought Prompting\n",
      "Yew Ken Chia∗1,\n",
      "Guizhen Chen∗1, 2\n",
      "Luu Anh Tuan2\n",
      "Soujanya Pori\n",
      "{'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\\nreasoning, the underlying process remains less well understood. Although\\nlogically sound reasoning appears inherently crucial for chain of thought,\\nprior studies surprisingly reveal minimal impact when using invalid\\ndemonstrations instead. Furthermore, the conventional chain of thought does not\\ninform language models on what mistakes to avoid, which potentially leads to\\nmore errors. Hence, inspired by how humans can learn from both positive and\\nnegative examples, we propose contrastive chain of thought to enhance language\\nmodel reasoning. Compared to the conventional chain of thought, our approach\\nprovides both valid and invalid reasoning demonstrations, to guide the model to\\nreason step-by-step while reducing reasoning mistakes. To improve\\ngeneralization, we introduce an automatic method to construct contrastive\\ndemonstrations. Our experiments on reasoning benchmarks demonstrate that\\ncontrastive chain of thought can serve as a general enhancement of\\nchain-of-thought prompting.', 'entry_id': 'http://arxiv.org/abs/2311.09277v1', 'published_first_time': '2023-11-15', 'comment': None, 'journal_ref': None, 'doi': None, 'primary_category': 'cs.CL', 'categories': ['cs.CL'], 'links': ['http://arxiv.org/abs/2311.09277v1', 'http://arxiv.org/pdf/2311.09277v1']}\n"
     ]
    }
   ],
   "source": [
    "docs = await loader.aload()\n",
    "print(docs[0].page_content[:100])\n",
    "print(docs[0].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8d5ee42",
   "metadata": {},
   "source": [
    "## Use Summaries of Articles as Docs\n",
    "\n",
    "Use ```get_summaries_as_docs``` method to get summaries of articles as docs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e004263c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Large language models (LLMs) have demonstrated impressive reasoning\n",
      "abilities, but they still strugg\n",
      "{'Entry ID': 'http://arxiv.org/abs/2410.13080v1', 'Published': datetime.date(2024, 10, 16), 'Title': 'Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models', 'Authors': 'Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, Shirui Pan'}\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import ArxivLoader\n",
    "\n",
    "loader = ArxivLoader(\n",
    "    query=\"reasoning\"\n",
    ")\n",
    "\n",
    "docs = loader.get_summaries_as_docs()\n",
    "print(docs[0].page_content[:100])\n",
    "print(docs[0].metadata)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain-kr-lwwSZlnu-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
